In [1]:
import pandas as pd
import numpy as np
In [4]:
df=pd.read_csv('C:/Users/Haytham/Desktop/Story Of Films-Recommandation/movies_metadata.csv')
df.head()
Out[4]:
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview ... release_date revenue runtime spoken_languages status tagline title video vote_average vote_count
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... ... 1995-10-30 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Toy Story False 7.7 5415.0
1 False NaN 65000000 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... NaN 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... ... 1995-12-15 262797249.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released Roll the dice and unleash the excitement! Jumanji False 6.9 2413.0
2 False {'id': 119050, 'name': 'Grumpy Old Men Collect... 0 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... NaN 15602 tt0113228 en Grumpier Old Men A family wedding reignites the ancient feud be... ... 1995-12-22 0.0 101.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Still Yelling. Still Fighting. Still Ready for... Grumpier Old Men False 6.5 92.0
3 False NaN 16000000 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... NaN 31357 tt0114885 en Waiting to Exhale Cheated on, mistreated and stepped on, the wom... ... 1995-12-22 81452156.0 127.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Friends are the people who let you be yourself... Waiting to Exhale False 6.1 34.0
4 False {'id': 96871, 'name': 'Father of the Bride Col... 0 [{'id': 35, 'name': 'Comedy'}] NaN 11862 tt0113041 en Father of the Bride Part II Just when George Banks has recovered from his ... ... 1995-02-10 76578911.0 106.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Just When His World Is Back To Normal... He's ... Father of the Bride Part II False 5.7 173.0

5 rows × 24 columns

1- Understanding the Dataset

The dataset above was obtained through the TMDB API. The movies available in this dataset are in correspondence with the movies that are listed in the MovieLens Latest Full Dataset comprising of 26 million ratings on 45,000 movies from 27,000 users. Let us have a look at the features that are available to us.

In [5]:
df.columns
Out[5]:
Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')
In [4]:
df.shape
Out[4]:
(45466, 24)
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
adult                    45466 non-null object
belongs_to_collection    4494 non-null object
budget                   45466 non-null object
genres                   45466 non-null object
homepage                 7782 non-null object
id                       45466 non-null object
imdb_id                  45449 non-null object
original_language        45455 non-null object
original_title           45466 non-null object
overview                 44512 non-null object
popularity               45461 non-null object
poster_path              45080 non-null object
production_companies     45463 non-null object
production_countries     45463 non-null object
release_date             45379 non-null object
revenue                  45460 non-null float64
runtime                  45203 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null object
tagline                  20412 non-null object
title                    45460 non-null object
video                    45460 non-null object
vote_average             45460 non-null float64
vote_count               45460 non-null float64
dtypes: float64(4), object(20)
memory usage: 8.3+ MB

There are a total of 45,466 movies with 24 features. Most of the features have very few NaN values (apart from homepage and tagline). We will attempt at cleaning this dataset to a form suitable for analysis in the next section.

2- Data Wrangling and Data preprocessing

The data that was originally obtained was in the form of a JSON File. This was converted manually into a CSV file to arrive at an input that could be loaded into a Pandas DataFrame effortlessly. In other words, the dataset we have in our hands is already relatively clean. We will however attempt at learning more about our features and performing appropriate wrangling steps to arrive at a form that is more suitable for analysis.

Let us start by removing the features that are not useful to us.

In [6]:
df = df.drop('imdb_id', axis=1)

The original title refers to the title of the movie in the native language in which the movie was shot. As such, I will prefer using the translated, Anglicized name in this analysis and hence, will drop the original titles altogether. We will be able to deduce if the movie is a foreign language film by looking at the original_language feature so no tangible information is lost in doing so.

In [7]:
df.drop('original_title', axis=1)
Out[7]:
adult belongs_to_collection budget genres homepage id original_language overview popularity poster_path ... release_date revenue runtime spoken_languages status tagline title video vote_average vote_count
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... http://toystory.disney.com/toy-story 862 en Led by Woody, Andy's toys live happily in his ... 21.9469 /rhIRbceoE9lR4veEXuwCC2wARtG.jpg ... 1995-10-30 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Toy Story False 7.7 5415.0
1 False NaN 65000000 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... NaN 8844 en When siblings Judy and Peter discover an encha... 17.0155 /vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg ... 1995-12-15 262797249.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released Roll the dice and unleash the excitement! Jumanji False 6.9 2413.0
2 False {'id': 119050, 'name': 'Grumpy Old Men Collect... 0 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... NaN 15602 en A family wedding reignites the ancient feud be... 11.7129 /6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg ... 1995-12-22 0.0 101.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Still Yelling. Still Fighting. Still Ready for... Grumpier Old Men False 6.5 92.0
3 False NaN 16000000 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... NaN 31357 en Cheated on, mistreated and stepped on, the wom... 3.85949 /16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg ... 1995-12-22 81452156.0 127.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Friends are the people who let you be yourself... Waiting to Exhale False 6.1 34.0
4 False {'id': 96871, 'name': 'Father of the Bride Col... 0 [{'id': 35, 'name': 'Comedy'}] NaN 11862 en Just when George Banks has recovered from his ... 8.38752 /e64sOI48hQXyru7naBFyssKFxVd.jpg ... 1995-02-10 76578911.0 106.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Just When His World Is Back To Normal... He's ... Father of the Bride Part II False 5.7 173.0
5 False NaN 60000000 [{'id': 28, 'name': 'Action'}, {'id': 80, 'nam... NaN 949 en Obsessive master thief, Neil McCauley leads a ... 17.9249 /zMyfPUelumio3tiDKPffaUpsQTD.jpg ... 1995-12-15 187436818.0 170.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released A Los Angeles Crime Saga Heat False 7.7 1886.0
6 False NaN 58000000 [{'id': 35, 'name': 'Comedy'}, {'id': 10749, '... NaN 11860 en An ugly duckling having undergone a remarkable... 6.67728 /jQh15y5YB7bWz1NtffNZmRw0s9D.jpg ... 1995-12-15 0.0 127.0 [{'iso_639_1': 'fr', 'name': 'Français'}, {'is... Released You are cordially invited to the most surprisi... Sabrina False 6.2 141.0
7 False NaN 0 [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam... NaN 45325 en A mischievous young boy, Tom Sawyer, witnesses... 2.56116 /sGO5Qa55p7wTu7FJcX4H4xIVKvS.jpg ... 1995-12-22 0.0 97.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released The Original Bad Boys. Tom and Huck False 5.4 45.0
8 False NaN 35000000 [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam... NaN 9091 en International action superstar Jean Claude Van... 5.23158 /eoWvKD60lT95Ss1MYNgVExpo5iU.jpg ... 1995-12-22 64350171.0 106.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Terror goes into overtime. Sudden Death False 5.5 174.0
9 False {'id': 645, 'name': 'James Bond Collection', '... 58000000 [{'id': 12, 'name': 'Adventure'}, {'id': 28, '... http://www.mgm.com/view/movie/757/Goldeneye/ 710 en James Bond must unmask the mysterious head of ... 14.686 /5c0ovjT41KnYIHYuF4AWsTe3sKh.jpg ... 1995-11-16 352194034.0 130.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released No limits. No fears. No substitutes. GoldenEye False 6.6 1194.0
10 False NaN 62000000 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... NaN 9087 en Widowed U.S. president Andrew Shepherd, one of... 6.31844 /lymPNGLZgPHuqM29rKMGV46ANij.jpg ... 1995-11-17 107879496.0 106.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Why can't the most powerful man in the world h... The American President False 6.5 199.0
11 False NaN 0 [{'id': 35, 'name': 'Comedy'}, {'id': 27, 'nam... NaN 12110 en When a lawyer shows up at the vampire's doorst... 5.43033 /xve4cgfYItnOhtzLYoTwTVy5FGr.jpg ... 1995-12-22 0.0 88.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released NaN Dracula: Dead and Loving It False 5.7 210.0
12 False {'id': 117693, 'name': 'Balto Collection', 'po... 0 [{'id': 10751, 'name': 'Family'}, {'id': 16, '... NaN 21032 en An outcast half-wolf risks his life to prevent... 12.1407 /gV5PCAVCPNxlOLFM1bKk50EqLXO.jpg ... 1995-12-22 11348324.0 78.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Part Dog. Part Wolf. All Hero. Balto False 7.1 423.0
13 False NaN 44000000 [{'id': 36, 'name': 'History'}, {'id': 18, 'na... NaN 10858 en An all-star cast powers this epic look at Amer... 5.092 /cICkmCEiXRhvZmbuAlsA5D9B2rK.jpg ... 1995-12-22 13681765.0 192.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Triumphant in Victory, Bitter in Defeat. He Ch... Nixon False 7.1 72.0
14 False NaN 98000000 [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam... NaN 1408 en Morgan Adams and her slave, William Shaw, are ... 7.28448 /odM9973kIv9hcjfHPp6g6BlyTIJ.jpg ... 1995-12-22 10017322.0 119.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released The Course Has Been Set. There Is No Turning B... Cutthroat Island False 5.7 137.0
15 False NaN 52000000 [{'id': 18, 'name': 'Drama'}, {'id': 80, 'name... NaN 524 en The life of the gambling paradise – Las Vegas ... 10.1374 /xo517ibXBDdYQY81j0WIG7BVcWq.jpg ... 1995-11-22 116112375.0 178.0 [{'iso_639_1': 'en', 'name': 'English'}] Released No one stays at the top forever. Casino False 7.8 1343.0
16 False NaN 16500000 [{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n... NaN 4584 en Rich Mr. Dashwood dies, leaving his second wif... 10.6732 /lA9HTy84Bb6ZwNeyoZKobcMdpMc.jpg ... 1995-12-13 135000000.0 136.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Lose your heart and come to your senses. Sense and Sensibility False 7.2 364.0
17 False NaN 4000000 [{'id': 80, 'name': 'Crime'}, {'id': 35, 'name... NaN 5 en It's Ted the Bellhop's first night on the job.... 9.02659 /eQs5hh9rxrk1m4xHsIz1w11Ngqb.jpg ... 1995-12-09 4300000.0 98.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Twelve outrageous guests. Four scandalous requ... Four Rooms False 6.5 539.0
18 False {'id': 3167, 'name': 'Ace Ventura Collection',... 30000000 [{'id': 80, 'name': 'Crime'}, {'id': 35, 'name... NaN 9273 en Summoned from an ashram in Tibet, Ace finds hi... 8.20545 /wRlGnJhEzcxBjvWtvbjhDSU1cIY.jpg ... 1995-11-10 212385533.0 90.0 [{'iso_639_1': 'en', 'name': 'English'}] Released New animals. New adventures. Same hair. Ace Ventura: When Nature Calls False 6.1 1128.0
19 False NaN 60000000 [{'id': 28, 'name': 'Action'}, {'id': 35, 'nam... NaN 11517 en A vengeful New York transit cop decides to ste... 7.33791 /jSozzzVOR2kfXgTUuGnbgG2yRFi.jpg ... 1995-11-21 35431113.0 103.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Get on, or GET OUT THE WAY! Money Train False 5.4 224.0
20 False {'id': 91698, 'name': 'Chili Palmer Collection... 30250000 [{'id': 35, 'name': 'Comedy'}, {'id': 53, 'nam... NaN 8012 en Chili Palmer is a Miami mobster who gets sent ... 12.6696 /vWtDUUgQAsVyvRW4mE75LBgVm2e.jpg ... 1995-10-20 115101622.0 105.0 [{'iso_639_1': 'en', 'name': 'English'}] Released The mob is tough, but it’s nothing like show b... Get Shorty False 6.4 305.0
21 False NaN 0 [{'id': 18, 'name': 'Drama'}, {'id': 53, 'name... NaN 1710 en An agoraphobic psychologist and a female detec... 10.7018 /80czeJGSoik22fhtUM9WzyjUU4r.jpg ... 1995-10-27 0.0 124.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released One man is copying the most notorious killers ... Copycat False 6.5 199.0
22 False NaN 50000000 [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam... NaN 9691 en Assassin Robert Rath arrives at a funeral to k... 11.0659 /xAx5MP7Dg4y85pyS7atX6eWk4Qd.jpg ... 1995-10-06 30303072.0 132.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released In the shadows of life, In the business of dea... Assassins False 6.0 394.0
23 False NaN 0 [{'id': 18, 'name': 'Drama'}, {'id': 14, 'name... NaN 12665 en Harassed by classmates who won't accept his sh... 12.1331 /1uRKsxOCtgz0xVqs9l4hYtp4dFm.jpg ... 1995-10-27 0.0 111.0 [{'iso_639_1': 'en', 'name': 'English'}] Released An extraordinary encounter with another human ... Powder False 6.3 143.0
24 False NaN 3600000 [{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n... http://www.mgm.com/title_title.do?title_star=L... 451 en Ben Sanderson, an alcoholic Hollywood screenwr... 10.332 /37qHRJxnSh5YkuaN9FgfNnMl3Tj.jpg ... 1995-10-27 49800000.0 112.0 [{'iso_639_1': 'en', 'name': 'English'}] Released I Love You... The Way You Are. Leaving Las Vegas False 7.1 365.0
25 False NaN 0 [{'id': 18, 'name': 'Drama'}] NaN 16420 en The evil Iago pretends to be friend of Othello... 1.8459 /qM0BXEQjmnAzlkDZ0tYmV6twqMX.jpg ... 1995-12-15 0.0 123.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Envy, greed, jealousy and love. Othello False 7.0 33.0
26 False NaN 12000000 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... NaN 9263 en Waxing nostalgic about the bittersweet passage... 8.68132 /wD6rLdD2Ix3u9YLgE3Do8GyCHoz.jpg ... 1995-10-20 27400000.0 100.0 [{'iso_639_1': 'en', 'name': 'English'}] Released In every woman there is the girl she left behind. Now and Then False 6.6 91.0
27 False NaN 0 [{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n... NaN 17015 en This film adaptation of Jane Austen's last nov... 2.22843 /si8911IezMvAnQFDvyg1nKzDlD.jpg ... 1995-09-27 0.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Persuasion False 7.4 36.0
28 False NaN 18000000 [{'id': 14, 'name': 'Fantasy'}, {'id': 878, 'n... NaN 902 fr A scientist in a surrealist society kidnaps ch... 9.82242 /eVo6ewq4akfyJYy3GXkMsLNzEJc.jpg ... 1995-05-16 1738611.0 108.0 [{'iso_639_1': 'cn', 'name': '广州话 / 廣州話'}, {'i... Released Where happily ever after is just a dream. The City of Lost Children False 7.6 308.0
29 False NaN 0 [{'id': 18, 'name': 'Drama'}, {'id': 80, 'name... NaN 37557 zh A provincial boy related to a Shanghai crime f... 1.10091 /qcoOCoN7viOhboGwhYXyApdDuiq.jpg ... 1995-04-30 0.0 108.0 [{'iso_639_1': 'zh', 'name': '普通话'}] Released In 1930's Shanghai violence was not the proble... Shanghai Triad False 6.5 17.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
45436 False NaN 0 [{'id': 28, 'name': 'Action'}, {'id': 9648, 'n... NaN 45527 en A stranger named Silas flees from a devastatin... 1.270832 /sEP5bLtK5IyQxOFtq5AYz4kUpzD.jpg ... 2010-01-01 0.0 92.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Action, Horror The Final Storm False 3.7 11.0
45437 False NaN 0 [{'id': 10751, 'name': 'Family'}, {'id': 16, '... NaN 455661 en A closeted boy runs the risk of being outed by... 20.82178 /wJUJROdLOtOzMixkjkx1aaZGSLl.jpg ... 2017-06-01 0.0 4.0 [{'iso_639_1': 'en', 'name': 'English'}] Released The Heart Wants What The Heart Wants In a Heartbeat False 8.3 146.0
45438 False NaN 0 [{'id': 18, 'name': 'Drama'}] NaN 327237 nl Bloed, Zweet en Tranen (Blood, Sweat and Tears... 0.590087 /aNrh6DfTngUvi7WdpTjLTPzYbTU.jpg ... 2015-04-02 0.0 111.0 [{'iso_639_1': 'nl', 'name': 'Nederlands'}] Released The movie about Andre Hazes Blood, Sweat and Tears False 6.8 11.0
45439 False NaN 0 [{'id': 18, 'name': 'Drama'}, {'id': 14, 'name... NaN 84710 en Paula the ape woman (Acquanetta) is alive and ... 0.143223 /udZ3hvhC52su7UQrGsIyDxK7IQA.jpg ... 1944-06-01 0.0 61.0 [{'iso_639_1': 'en', 'name': 'English'}] Released RAPTUROUS BEAUTY!...FURY OF A BEAST! Jungle Woman False 5.8 2.0
45440 False NaN 0 [{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n... NaN 39562 en Pretty, popular, and slim high-schooler Aly Sc... 0.767762 /w1TxeDOMS5so83bECaQECW1pV9.jpg ... 2007-01-08 0.0 89.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN To Be Fat Like Me False 5.0 12.0
45441 False NaN 0 [{'id': 35, 'name': 'Comedy'}] NaN 14008 en Hyperactive teenager Kelly is enrolled into a ... 4.392389 /9sQVYo7B9cSexWXoQVzFFijCbD2.jpg ... 2002-03-07 0.0 101.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Too Cool For The Rules! Cadet Kelly False 5.2 145.0
45442 False NaN 0 [] NaN 44330 en A combination gambling den and bawdy house is ... 0.21926 /6MwBolpHcddjB3JQyNo2AfB8e3L.jpg ... 1905-01-01 0.0 3.0 [] NaN NaN The Scheming Gambler's Paradise False 5.0 3.0
45443 False NaN 0 [{'id': 35, 'name': 'Comedy'}, {'id': 14, 'nam... NaN 49279 fr A chemist in his laboratory places upon a tabl... 1.618458 /9oqTGZcQf0UwsbA8mKPeJzPRJ5e.jpg ... 1901-01-01 0.0 3.0 [] Released NaN The Man with the Rubber Head False 7.6 29.0
45444 False NaN 0 [{'id': 14, 'name': 'Fantasy'}] NaN 44333 fr A bearded magician holds up a large playing ca... 0.208349 /v7V9SA14gmznzpvxbKCtA4BzQ7z.jpg ... 1905-01-01 0.0 3.0 [{'iso_639_1': 'xx', 'name': 'No Language'}] Released NaN The Living Playing Cards False 6.1 8.0
45445 False NaN 0 [{'id': 14, 'name': 'Fantasy'}, {'id': 35, 'na... NaN 49277 en A wall full of advertising posters comes to life. 0.148131 /nLVgY3fYMnRbRzEoz3fF0wrqtCr.jpg ... 1906-01-01 0.0 3.0 [{'iso_639_1': 'fr', 'name': 'Français'}, {'is... Released NaN The Hilarious Posters False 4.5 2.0
45446 False NaN 0 [{'id': 14, 'name': 'Fantasy'}, {'id': 35, 'na... NaN 49271 en A man rents an apartment and furnishes it in r... 0.725084 /tX3x4Jms6xqVgiet0Ei2Ks0XkAg.jpg ... 1909-01-01 0.0 6.0 [] Released NaN The Devilish Tenant False 6.7 12.0
45447 False NaN 0 [] NaN 44324 fr The background of this picture represents a sc... 0.213973 /ifWvveLPWWrzitz4oL01YMEokiB.jpg ... 1904-03-05 0.0 3.0 [] Released NaN The Untameable Whiskers False 6.0 6.0
45448 False NaN 0 [] NaN 122036 fr This shows a prince entering upon the stage of... 0.071782 /ifWvveLPWWrzitz4oL01YMEokiB.jpg ... 1904-01-01 0.0 2.0 [] Released NaN The Imperceptable Transmutations False 5.0 2.0
45449 False NaN 0 [{'id': 16, 'name': 'Animation'}, {'id': 10751... NaN 14885 en It's Halloween in the 100 Acre Wood, and Roo's... 2.568495 /yU6byKhpzHG1RFpY5PYM6mfhfU1.jpg ... 2005-09-13 0.0 67.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Celebrate Lumpy's First Halloween. Pooh's Heffalump Halloween Movie False 5.4 7.0
45450 False NaN 0 [{'id': 14, 'name': 'Fantasy'}, {'id': 28, 'na... NaN 49280 fr A band-leader has arranged seven chairs for th... 1.109068 /ZLOgI7KjtWby1NEg2pjU2Id60W.jpg ... 1900-01-01 0.0 1.0 [{'iso_639_1': 'xx', 'name': 'No Language'}] Released NaN The One-Man Band False 6.5 22.0
45451 False NaN 0 [{'id': 35, 'name': 'Comedy'}, {'id': 14, 'nam... NaN 106807 fr A series of fantastical wrestling matches. 0.225432 /tn5J3PszVETDkKj49nakmeqr8Od.jpg ... 1900-01-01 0.0 2.0 [{'iso_639_1': 'xx', 'name': 'No Language'}] Released NaN The Fat and Lean Wrestling Match False 6.5 6.0
45452 False NaN 0 [{'id': 99, 'name': 'Documentary'}] NaN 276895 en Deep Hearts is a film about the Bororo Fulani,... 0.011025 /8jI4ykkIVDmrYgUjDld9i0aulMq.jpg ... 1981-01-01 0.0 58.0 [{'iso_639_1': 'ff', 'name': 'Fulfulde'}, {'is... Released NaN Deep Hearts False 0.0 0.0
45453 False NaN 0 [{'id': 80, 'name': 'Crime'}, {'id': 18, 'name... NaN 404604 hi The bliss of a biology teacher’s family life i... 1.559596 /zZwbntqdfKdVgzH1RoMHa99g0mJ.jpg ... 2017-07-07 0.0 146.0 [{'iso_639_1': 'hi', 'name': 'हिन्दी'}] Released NaN Mom False 6.6 14.0
45454 False NaN 0 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... NaN 420346 en The Morning After is a feature film that consi... 0.139936 /rpkDqyKdXahRcZIEC9I02EBSwje.jpg ... 2015-01-11 0.0 79.0 [{'iso_639_1': 'en', 'name': 'English'}] Released What happened last night? The Morning After False 4.0 2.0
45455 False NaN 0 [] NaN 67179 it Sentenced to life imprisonment for illegal act... 0.225051 /j1AN0L4motTt8SBxeTMXDtExsYl.jpg ... 1972-01-01 0.0 90.0 [{'iso_639_1': 'it', 'name': 'Italiano'}] Released NaN St. Michael Had a Rooster False 6.0 3.0
45456 False NaN 0 [{'id': 27, 'name': 'Horror'}, {'id': 9648, 'n... NaN 84419 en An unsuccessful sculptor saves a madman named ... 0.222814 /yMnq9mL5uYxbRgwKqyz1cVGCJYJ.jpg ... 1946-03-29 0.0 65.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Meet...The CREEPER! House of Horrors False 6.3 8.0
45457 False NaN 0 [{'id': 9648, 'name': 'Mystery'}, {'id': 27, '... NaN 390959 en In this true-crime documentary, we delve into ... 0.076061 /q75tCM4pFmUzdCg0gqcOQquCaYf.jpg ... 2000-10-22 0.0 45.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Shadow of the Blair Witch False 7.0 2.0
45458 False NaN 0 [{'id': 27, 'name': 'Horror'}] NaN 289923 en A film archivist revisits the story of Rustin ... 0.38645 /lXtoHVdej6kS1Dc7KAhw05sMos9.jpg ... 2000-10-03 0.0 30.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Do you know what happened 50 years before "The... The Burkittsville 7 False 7.0 1.0
45459 False NaN 0 [{'id': 878, 'name': 'Science Fiction'}] NaN 222848 en It's the year 3000 AD. The world's most danger... 0.661558 /4lF9LH0b0Z1X94xGK9IOzqEW6k1.jpg ... 1995-01-01 0.0 85.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Caged Heat 3000 False 3.5 1.0
45460 False NaN 0 [{'id': 18, 'name': 'Drama'}, {'id': 28, 'name... NaN 30840 en Yet another version of the classic epic, with ... 5.683753 /fQC46NglNiEMZBv5XHoyLuOWoN5.jpg ... 1991-05-13 0.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Robin Hood False 5.7 26.0
45461 False NaN 0 [{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n... http://www.imdb.com/title/tt6209470/ 439050 fa Rising and falling between a man and woman. 0.072051 /jldsYflnId4tTWPx8es3uzsB1I8.jpg ... NaN 0.0 90.0 [{'iso_639_1': 'fa', 'name': 'فارسی'}] Released Rising and falling between a man and woman Subdue False 4.0 1.0
45462 False NaN 0 [{'id': 18, 'name': 'Drama'}] NaN 111109 tl An artist struggles to finish his work while a... 0.178241 /xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg ... 2011-11-17 0.0 360.0 [{'iso_639_1': 'tl', 'name': ''}] Released NaN Century of Birthing False 9.0 3.0
45463 False NaN 0 [{'id': 28, 'name': 'Action'}, {'id': 18, 'nam... NaN 67758 en When one of her hits goes wrong, a professiona... 0.903007 /d5bX92nDsISNhu3ZT69uHwmfCGw.jpg ... 2003-08-01 0.0 90.0 [{'iso_639_1': 'en', 'name': 'English'}] Released A deadly game of wits. Betrayal False 3.8 6.0
45464 False NaN 0 [] NaN 227506 en In a small town live two brothers, one a minis... 0.003503 /aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg ... 1917-10-21 0.0 87.0 [] Released NaN Satan Triumphant False 0.0 0.0
45465 False NaN 0 [] NaN 461257 en 50 years after decriminalisation of homosexual... 0.163015 /s5UkZt6NTsrS7ZF0Rh8nzupRlIU.jpg ... 2017-06-09 0.0 75.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Queerama False 0.0 0.0

45466 rows × 22 columns

In [8]:
df[df['revenue'] == 0].shape
Out[8]:
(38052, 23)

We see that the majority of the movies have a recorded revenue of 0. This indicates that we do not have information about the total revenue for these movies. Although this forms the majority of the movies available to us, we will still use revenue as an extremely important feature going forward from the remaining 7000 moves.

In [9]:
df['revenue'] = df['revenue'].replace(0, np.nan)

The budget feature has some unclean values that makes Pandas assign it as a generic object. We proceed to convert this into a numeric variable and replace all the non-numeric values with NaN. Finally, as with budget, we will convert all the values of 0 with NaN to indicate the absence of information regarding budget.

In [10]:
df['budget']=pd.to_numeric(df['budget'],errors='coerce')
df['budget']=df['budget'].replace(0,np.nan)
df[df['budget'].isnull()].shape
Out[10]:
(36576, 23)

As we move forward trying to answer certain questions, we will have to construct several features suitable for that particular query. For now, we will construct two very important features:

year: The year in which the movie was released. return: The ratio of revenue to budget. The return feature is extremely insightful as it will give us a more accurate picture of the financial success of a movie. Presently, our data will not be able to judge if a $200 million budget movie that earned $100 million did better than a $50,000 budget movie taking in $200,000. This feature will be able to capture that information.

A return value > 1 would indicate profit whereas a return value < 1 would indicate a loss

In [11]:
df['return']=df['revenue']/df['budget']
df[df['return'].isnull()].shape
Out[11]:
(40085, 24)

We have close to 5000 movies for which we have data on revenue and budget ratio. This is close to 10% of the entire dataset. Although this may seem small, this is enough to perform very useful analysis and discover interesting insights about the world of movies.

In [12]:
df['year'] = pd.to_datetime(df['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)
In [13]:
df['adult'].value_counts()
Out[13]:
False                                                                                                                             45454
True                                                                                                                                  9
 Rune Balot goes to a casino connected to the October corporation to try to wrap up her case once and for all.                        1
 Avalanche Sharks tells the story of a bikini contest that turns into a horrifying affair when it is hit by a shark avalanche.        1
 - Written by Ørnås                                                                                                                   1
Name: adult, dtype: int64

There are close to 0 adult movies in this dataset. The adult feature therefore is not of much use to us and can be safely dropped.

In [14]:
df = df.drop('adult', axis=1)
In [15]:
base_poster_url = 'http://image.tmdb.org/t/p/w185/'
df['poster_path'] = "<img src='" + base_poster_url + df['poster_path'] + "' style='height:100px;'>"

3- Exploratory Data Analysis

Title and Overview Wordclouds

which are considered more potent and considered more worthy of a title. Let us find out!

In [16]:
df['title'] = df['title'].astype('str')
df['overview'] = df['overview'].astype('str')
In [17]:
title_corpus =' '.join(df['title'].tolist())
overview_corpus=' '.join(df['overview'].tolist())
In [18]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
In [19]:
title_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', height=2000, width=4000).generate(title_corpus)
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud)
plt.axis('off')
plt.show()

The word : Love is the most commonly used word in movie titles. Girl, Day , Night and Man are also among the most commonly occuring words. I think this encapsulates the idea of the ubiquitious presence of romance in movies pretty well.

In [20]:
title_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', height=2000, width=4000).generate(overview_corpus)
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud)
plt.axis('off')
plt.show()

Life is the most commonly used word in Movie titles. One and Find are also popular in Movie Blurbs. Together with Love, Man and Girl, these wordclouds give us a pretty good idea of the most popular themes present in movies.

Production Countries

The Full MovieLens Dataset consists of movies that are overwhelmingly in the English language (more than 31000). However, these movies may have shot in various locations around the world. It would be interesting to see which countries serve as the most popular destinations for shooting movies by filmmakers, especially those in the United States of America and the United Kingdom.

In [21]:
import ast
df['production_countries'] = df['production_countries'].fillna('[]').apply(ast.literal_eval)
df['production_countries'] = df['production_countries'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
In [22]:
s = df.apply(lambda x: pd.Series(x['production_countries']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'countries'
In [23]:
con_df = df.drop('production_countries', axis=1).join(s)
con_df = pd.DataFrame(con_df['countries'].value_counts())
con_df['country'] = con_df.index
con_df.columns = ['num_movies', 'country']
con_df = con_df.reset_index().drop('index', axis=1)
con_df.head(10)
Out[23]:
num_movies country
0 21153 United States of America
1 4094 United Kingdom
2 3940 France
3 2254 Germany
4 2169 Italy
5 1765 Canada
6 1648 Japan
7 964 Spain
8 912 Russia
9 828 India
In [24]:
con_df = con_df[con_df['country'] != 'United States of America']
In [25]:
import plotly
import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
In [26]:
data = [ dict(
        type = 'choropleth',
        locations = con_df['country'],
        locationmode = 'country names',
        z = con_df['num_movies'],
        text = con_df['country'],
        colorscale = [[0,'rgb(255, 255, 255)'],[1,'rgb(255, 0, 0)']],
        autocolorscale = False,
        reversescale = False,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 0.5
            ) ),
        colorbar = dict(
            autotick = False,
            tickprefix = '',
            title = 'Production Countries'),
      ) ]

layout = dict(
    title = 'Production Countries for the MovieLens Movies (Apart from US)',
    geo = dict(
        showframe = False,
        showcoastlines = False,
        projection = dict(
            type = 'Mercator'
        )
    )
)

fig = dict( data=data, layout=layout )
py.iplot( fig, validate=False, filename='d3-world-map' )

Unsurprisingly, the United States is the most popular destination of production for movies given that our dataset largely consists of English movies. Europe is also an extremely popular location with the UK, France, Germany and Italy in the top 5. Japan and India are the most popular Asian countries when it comes to movie production.

Franchise Movies

Let us now have a brief look at Franchise movies. I was curious to discover the longest running and the most successful franchises among many other things. Let us wrangle our data to find out!

In [27]:
df_fran = df[df['belongs_to_collection'].notnull()]
df_fran['belongs_to_collection'] = df_fran['belongs_to_collection'].apply(ast.literal_eval).apply(lambda x: x['name'] if isinstance(x, dict) else np.nan)
df_fran = df_fran[df_fran['belongs_to_collection'].notnull()]
C:\Users\Haytham\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [28]:
fran_pivot = df_fran.pivot_table(index='belongs_to_collection', values='revenue', aggfunc={'revenue': ['mean', 'sum', 'count']}).reset_index()
In [29]:
fran_pivot.sort_values('sum', ascending=False).head(10)
Out[29]:
belongs_to_collection count mean sum
552 Harry Potter Collection 8 9.634209e+08 7.707367e+09
1160 Star Wars Collection 8 9.293118e+08 7.434495e+09
646 James Bond Collection 26 2.733450e+08 7.106970e+09
1317 The Fast and the Furious Collection 8 6.406373e+08 5.125099e+09
968 Pirates of the Caribbean Collection 5 9.043154e+08 4.521577e+09
1550 Transformers Collection 5 8.732202e+08 4.366101e+09
325 Despicable Me Collection 4 9.227676e+08 3.691070e+09
1491 The Twilight Collection 5 6.684215e+08 3.342107e+09
610 Ice Age Collection 5 6.433417e+08 3.216709e+09
666 Jurassic Park Collection 4 7.578710e+08 3.031484e+09

The Harry Potter Franchise is the most successful movie franchise raking in more than 7.707 billion dollars from 8 movies. The Star Wars Movies come in a close second with a 7.434 billion dollars from 8 movies too. James Bond is third but the franchise has significantly more movies compared to the others in the list and therefore, a much smaller average gross.

Most Successful Movie Franchises (by Average Gross)

We will use the average gross per movie to gauge the success of a movie franchise. However, this is not a very potent metric as the revenues in this dataset have not been adjusted for inflation. Therefore, revenue statistics will tend to strongly favor franchises in the recent times.

In [30]:
fran_pivot.sort_values('mean', ascending=False).head(10)
Out[30]:
belongs_to_collection count mean sum
112 Avatar Collection 1 2.787965e+09 2.787965e+09
1245 The Avengers Collection 2 1.462481e+09 2.924962e+09
479 Frozen Collection 1 1.274219e+09 1.274219e+09
446 Finding Nemo Collection 2 9.844532e+08 1.968906e+09
1352 The Hobbit Collection 3 9.785078e+08 2.935523e+09
1388 The Lord of the Rings Collection 3 9.721816e+08 2.916545e+09
552 Harry Potter Collection 8 9.634209e+08 7.707367e+09
1160 Star Wars Collection 8 9.293118e+08 7.434495e+09
325 Despicable Me Collection 4 9.227676e+08 3.691070e+09
968 Pirates of the Caribbean Collection 5 9.043154e+08 4.521577e+09

The Avatar Collection, although just consisting of one movie at the moment, is the most successful franchise of all time with the sole movie raking in close to 3 billion dollars. The Harry Potter franchise is still the most successful franchise with at least 5 movies.

Longest Running Franchises

Finally, in this subsection, let us take a look at the franchises which have stood the test of time and have managed to deliver the largest number of movies under a single banner. This metric is potent in the way that it isn't affected by inflation. However, this does not imply that successful movie franchises tend to have more movies. Some franchises, such as Harry Potter, have a predefined storyline and it wouldn't make sense to produce more movies despite its enormous success.

In [31]:
fran_pivot.sort_values('count', ascending=False).head(10)
Out[31]:
belongs_to_collection count mean sum
646 James Bond Collection 26 2.733450e+08 7.106970e+09
473 Friday the 13th Collection 12 3.874155e+07 4.648985e+08
976 Pokémon Collection 11 6.348189e+07 6.983008e+08
552 Harry Potter Collection 8 9.634209e+08 7.707367e+09
540 Halloween Collection 8 3.089601e+07 2.471681e+08
29 A Nightmare on Elm Street Collection 8 4.544894e+07 3.635916e+08
1317 The Fast and the Furious Collection 8 6.406373e+08 5.125099e+09
1432 The Pink Panther (Original) Collection 8 2.055978e+07 1.644782e+08
1160 Star Wars Collection 8 9.293118e+08 7.434495e+09
977 Police Academy Collection 7 4.352046e+07 3.046432e+08

The James Bond Movies is the largest franchise ever with over 26 movies released under the banner. Friday the 13th and Pokemon come in at a distant second and third with 12 and 11 movies respectively.

Production Companies

In [32]:
df['production_companies'] = df['production_companies'].fillna('[]').apply(ast.literal_eval)
df['production_companies'] = df['production_companies'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
In [33]:
s = df.apply(lambda x: pd.Series(x['production_companies']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'companies'
In [34]:
com_df = df.drop('production_companies', axis=1).join(s)
In [35]:
com_sum = pd.DataFrame(com_df.groupby('companies')['revenue'].sum().sort_values(ascending=False))
com_sum.columns = ['Total']
com_mean = pd.DataFrame(com_df.groupby('companies')['revenue'].mean().sort_values(ascending=False))
com_mean.columns = ['Average']
com_count = pd.DataFrame(com_df.groupby('companies')['revenue'].count().sort_values(ascending=False))
com_count.columns = ['Number']

com_pivot = pd.concat((com_sum, com_mean, com_count), axis=1)
C:\Users\Haytham\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: FutureWarning:

Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.


Highest Earning Production Companies

Let us find out which production companies have earned the most money from the movie making business.

In [36]:
com_pivot.sort_values('Total', ascending=False).head(10)
Out[36]:
Total Average Number
Warner Bros. 6.352519e+10 1.293792e+08 491
Universal Pictures 5.525919e+10 1.193503e+08 463
Paramount Pictures 4.880819e+10 1.235650e+08 395
Twentieth Century Fox Film Corporation 4.768775e+10 1.398468e+08 341
Walt Disney Pictures 4.083727e+10 2.778046e+08 147
Columbia Pictures 3.227974e+10 1.367785e+08 236
New Line Cinema 2.217339e+10 1.119868e+08 198
Amblin Entertainment 1.734372e+10 2.550547e+08 68
DreamWorks SKG 1.547575e+10 1.984071e+08 78
Dune Entertainment 1.500379e+10 2.419966e+08 62

Warner Bros is the highest earning production company of all time earning a staggering 63.5 billion dollars from close to 500 movies. Universal Pictures and Paramaount Pictures are the second and the third highest earning companies with 55 billion dollars and 48 billion dollars in revenue respectively.

Original Language

In this section, let us look at the languages of the movies in our dataset. From the production countries, we have already deduced that the majority of the movies in the dataset are English. Let us see what the other major languages represented are.

In [37]:
df['original_language'].drop_duplicates().shape[0]
Out[37]:
93
In [40]:
lang_df = pd.DataFrame(df['original_language'].value_counts())
lang_df['language'] = lang_df.index
lang_df.columns = ['number', 'language']
lang_df.head()
Out[40]:
number language
en 32269 en
fr 2438 fr
it 1529 it
ja 1350 ja
de 1080 de

There are over 93 languages represented in our dataset. As we had expected, English language films form the overwhelmingly majority. French and Italian movies come at a very distant second and third respectively. Let us represent the most popular languages (apart from English) in the form of a bar plot.

In [42]:
import seaborn as sns
plt.figure(figsize=(12,5))
sns.barplot(x='language', y='number', data=lang_df.iloc[1:11])
plt.show()

As mentioned earlier, French and Italian are the most commonly occurring languages after English. Japanese and Hindi form the majority as far as Asian Languages are concerned.

Popularity, Vote Average and Vote Count

In this section, we will work with metrics provided to us by TMDB users. We will try to gain a deeper understanding of the popularity, vote average and vote count features and try and deduce any relationships between them as well as other numeric features such as budget and revenue.

In [43]:
def clean_numeric(x):
    try:
        return float(x)
    except:
        return np.nan
In [44]:
df['popularity'] = df['popularity'].apply(clean_numeric).astype('float')
df['vote_count'] = df['vote_count'].apply(clean_numeric).astype('float')
df['vote_average'] = df['vote_average'].apply(clean_numeric).astype('float')
In [45]:
# Let us examine the summary statistics and the distribution of each feature one by one.
df['popularity'].describe()
Out[45]:
count    45460.000000
mean         2.921478
std          6.005414
min          0.000000
25%          0.385948
50%          1.127685
75%          3.678902
max        547.488298
Name: popularity, dtype: float64
In [46]:
sns.distplot(df['popularity'].fillna(df['popularity'].median()))
plt.show()
C:\Users\Haytham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

In [47]:
df['popularity'].plot(logy=True, kind='hist')
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x22acca014a8>
In [48]:
# Most Popular Movies by Popularity Score
df[['title', 'popularity', 'year']].sort_values('popularity', ascending=False).head(10)
Out[48]:
title popularity year
30700 Minions 547.488298 2015
33356 Wonder Woman 294.337037 2017
42222 Beauty and the Beast 287.253654 2017
43644 Baby Driver 228.032744 2017
24455 Big Hero 6 213.849907 2014
26564 Deadpool 187.860492 2016
26566 Guardians of the Galaxy Vol. 2 185.330992 2017
14551 Avatar 185.070892 2009
24351 John Wick 183.870374 2014
23675 Gone Girl 154.801009 2014

Minions is the most popular movie by the TMDB Popularity Score. Wonder Woman and Beauty and the Beast, two extremely successful woman centric movies come in second and third respectively.

In [49]:
df['vote_count'].describe()
Out[49]:
count    45460.000000
mean       109.897338
std        491.310374
min          0.000000
25%          3.000000
50%         10.000000
75%         34.000000
max      14075.000000
Name: vote_count, dtype: float64

As with popularity scores, the distribution of vote counts is extremely skewed with the median vote count standing at a paltry 10 votes. The most votes a single movie has got stands at 14,075. TMDB Votes, therefore, are not as potent and suggestive as its IMDB Counterpart. Nevertheless, let us check which the most voted on movies on the website are.

In [50]:
df[['title', 'vote_count', 'year']].sort_values('vote_count', ascending=False).head(10)
Out[50]:
title vote_count year
15480 Inception 14075.0 2010
12481 The Dark Knight 12269.0 2008
14551 Avatar 12114.0 2009
17818 The Avengers 12000.0 2012
26564 Deadpool 11444.0 2016
22879 Interstellar 11187.0 2014
20051 Django Unchained 10297.0 2012
23753 Guardians of the Galaxy 10014.0 2014
2843 Fight Club 9678.0 1999
18244 The Hunger Games 9634.0 2012

Inception and The Dark Knight, two critically acclaimed and commercially successful Christopher Nolan movies figure at the top of our chart.

In [51]:
df['vote_average'] = df['vote_average'].replace(0, np.nan)
df['vote_average'].describe()
Out[51]:
count    42462.000000
mean         6.014877
std          1.256208
min          0.500000
25%          5.300000
50%          6.100000
75%          6.900000
max         10.000000
Name: vote_average, dtype: float64
In [52]:
sns.distplot(df['vote_average'].fillna(df['vote_average'].median()))
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x22acf530048>

It appears that TMDB Users are extremely strict in their ratings. The mean rating is only a 5.6 on a scale of 10. Half the movies have a rating of less than or equal to 6. Let us check what the most critically acclaimed movies as per TMDB are. We will only consider those movies that have more than 2000 votes (similar to IMDB's criteria of 5000 votes in selecting its top 250).

Most Critically Acclaimed Movies

In [53]:
df[df['vote_count'] > 2000][['title', 'vote_average', 'vote_count' ,'year']].sort_values('vote_average', ascending=False).head(10)
Out[53]:
title vote_average vote_count year
314 The Shawshank Redemption 8.5 8358.0 1994
834 The Godfather 8.5 6024.0 1972
2211 Life Is Beautiful 8.3 3643.0 1997
5481 Spirited Away 8.3 3968.0 2001
1152 One Flew Over the Cuckoo's Nest 8.3 3001.0 1975
1176 Psycho 8.3 2405.0 1960
2843 Fight Club 8.3 9678.0 1999
1178 The Godfather: Part II 8.3 3418.0 1974
12481 The Dark Knight 8.3 12269.0 2008
292 Pulp Fiction 8.3 8670.0 1994

The Shawshank Redemption and The Godfather are the two most critically acclaimed movies in the TMDB Database. Interestingly, they are the top 2 movies in IMDB's Top 250 Movies list too. They have a rating of over 9 on IMDB as compared to their 8.5 TMDB Scores.

Do popularity and vote average share a tangible relationship? In other words, is there a strong positive correlation between these two quanitties? Let us visualise their relationship in the form of a scatterplot.

In [55]:
sns.jointplot(x='vote_average', y='popularity', data=df)
Out[55]:
<seaborn.axisgrid.JointGrid at 0x22acf439550>

Surprisingly,there is no tangible correlation. In other words, popularity and vote average and independent quantities. It would be interesting to discover how TMDB assigns numerical popularity scores to its movies.

In [56]:
sns.jointplot(x='vote_average', y='vote_count', data=df)
Out[56]:
<seaborn.axisgrid.JointGrid at 0x22acefbf0f0>

There is a very small correlation between Vote Count and Vote Average. A large number of votes on a particular movie does not necessarily imply that the movie is good.

Number of Movies by the year

The Dataset of 45,000 movies available to us does not represent the entire corpus of movies released since the inception of cinema. However, it is reasomnable to assume that it does include almost every major film released in Hollywood as well as other major film industries across the world (such as Bollywood in India). With this assumption in mind, let us take a look at the number of movies produced by the year.

In [57]:
year_count = df.groupby('year')['title'].count()
plt.figure(figsize=(18,5))
year_count.plot()
plt.
Out[57]:
<matplotlib.axes._subplots.AxesSubplot at 0x22acea26b00>

We notice that there is a sharp rise in the number of movies starting the 1990s decade. However, we will not look too much into this as it is entirely possible that recent movies were oversampled for the purposes of this dataset.

Next, let us take a look at the earliest movies represented in the dataset.

Earliest Movies Represented

In [58]:
df[df['year'] != 'NaT'][['title', 'year']].sort_values('year').head(10)
Out[58]:
title year
34940 Passage of Venus 1874
34937 Sallie Gardner at a Gallop 1878
41602 Buffalo Running 1883
34933 Man Walking Around a Corner 1887
34934 Accordion Player 1888
34938 Traffic Crossing Leeds Bridge 1888
34936 Monkeyshines, No. 2 1890
34939 London's Trafalgar Square 1890
34935 Monkeyshines, No. 1 1890
41194 Mosquinha 1890

The oldest movie, Passage of Venus, wss a series of photographs of the transit of the planet Venus across the Sun in 1874. They were taken in Japan by the French astronomer Pierre Janssen using his 'photographic revolver'. This is also the oldest movie on both IMDB and TMDB.

Movie Status

Although not entirely relevant to our analysis of movies, gathering information on the various kinds of movies based on their status of release can provide us interesitng insight on the nature of the movies present in our dataset. My preliminary hunch was that almost every movie has the Released status. Let's find out.

In [59]:
df['status'].value_counts()
Out[59]:
Released           45014
Rumored              230
Post Production       98
In Production         20
Planned               15
Canceled               2
Name: status, dtype: int64

Almost every movie is indeed released. However, it is interesting to see that MovieLens has user ratings for movies that are still in the planning, production and post production stage. We might take this information into account while building our collaborative filtering recommendation engine.

Spoken Languages

Does the number of spoken languages influence the success of a movie? To do this, we will convert our spoken_languages feature to a numeric feature denoting the number of languages spoken in that film.

In [60]:
df['spoken_languages'] = df['spoken_languages'].fillna('[]').apply(ast.literal_eval).apply(lambda x: len(x) if isinstance(x, list) else np.nan)
df['spoken_languages'].value_counts()
Out[60]:
1     33736
2      5371
0      3835
3      1705
4       550
5       178
6        62
7        14
8         6
9         5
19        1
13        1
12        1
10        1
Name: spoken_languages, dtype: int64

Most movies have just one language spoken in the entire duration of the film. 19 is the higest number of languages spoken in a film. Let us take a look at all the films with more than 10 spoken languages.

In [61]:
df[df['spoken_languages'] >= 10][['title', 'year', 'spoken_languages']].sort_values('spoken_languages', ascending=False)
Out[61]:
title year spoken_languages
22235 Visions of Europe 2004 19
35288 The Testaments 2000 13
14093 To Each His Own Cinema 2007 12
8789 The Adventures of Picasso 1978 10

The movie with the most number of languages, Visions of Europe is actually a collection of 25 short films by 25 different European directors. This explains the sheer diversity of the movie in terms of language

In [63]:
from scipy import stats
sns.jointplot(x="spoken_languages", y="return", data=df, stat_func=stats.spearmanr, color="m")
C:\Users\Haytham\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

C:\Users\Haytham\Anaconda3\lib\site-packages\seaborn\axisgrid.py:1847: UserWarning:

JointGrid annotation is deprecated and will be removed in a future release.

Out[63]:
<seaborn.axisgrid.JointGrid at 0x22ace9494e0>

The Spearman Coefficient is 0.018 indicating no correlation between the two quantities.

Budget

Let us now turn our attention to budget. We expect budgets to be a skewed quantity and also heavily influenced by inflation. Nevertheless, it would be interesting to gather as much insights as possible from this quantity as budget is often a critical feature in predicting movie revenue and success. As a start, let us gather the summary statistics for our budget.

In [64]:
df['budget'].describe()
Out[64]:
count    8.890000e+03
mean     2.160428e+07
std      3.431063e+07
min      1.000000e+00
25%      2.000000e+06
50%      8.000000e+06
75%      2.500000e+07
max      3.800000e+08
Name: budget, dtype: float64

The mean budget of a film is 21.6 million dollars whereas the median budget is far smaller at 8 million dollars. This strongly suggests the mean being influenced by outliers.

In [65]:
sns.distplot(df[df['budget'].notnull()]['budget'])
Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x22ad2b66518>
In [66]:
df['budget'].plot(logy=True, kind='hist')
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x22ad2a9f748>

The distribution of movie budgets shows an exponential decay. More than 75% of the movies have a budget smaller than 25 million dollars. Next, let us take a look at the most expensive movies of all time and the revenue & returns that they generated.

Most Expensive Movies of all Time

In [67]:
df[df['budget'].notnull()][['title', 'budget', 'revenue', 'return', 'year']].sort_values('budget', ascending=False).head(10)
Out[67]:
title budget revenue return year
17124 Pirates of the Caribbean: On Stranger Tides 380000000.0 1.045714e+09 2.751878 2011
11827 Pirates of the Caribbean: At World's End 300000000.0 9.610000e+08 3.203333 2007
26558 Avengers: Age of Ultron 280000000.0 1.405404e+09 5.019299 2015
11067 Superman Returns 270000000.0 3.910812e+08 1.448449 2006
44842 Transformers: The Last Knight 260000000.0 6.049421e+08 2.326701 2017
16130 Tangled 260000000.0 5.917949e+08 2.276134 2010
18685 John Carter 260000000.0 2.841391e+08 1.092843 2012
11780 Spider-Man 3 258000000.0 8.908716e+08 3.452991 2007
21175 The Lone Ranger 255000000.0 8.928991e+07 0.350157 2013
22059 The Hobbit: The Desolation of Smaug 250000000.0 9.584000e+08 3.833600 2013

Two Pirates of the Carribean films occupy the top spots in this list with a staggering budget of over 300 million dollars. All the top 10 most expensive films made a profit on their investment except for The Lone Ranger which managed to recoup less than 35% of its investment, taking in a paltry 90 million dollars on a 255 million dollar budget.

How strong a correlation does the budget hold with the revenue? A stronger correlation would directly imply more accurate forecasts.

In [68]:
sns.jointplot(x='budget',y='revenue',data=df[df['return'].notnull()])
Out[68]:
<seaborn.axisgrid.JointGrid at 0x22ad13fb828>

The pearson r value of 0.73 between the two quantities indicates a very strong correlation.

Revenue

The final numeric feature we will explore is the revenue. The revenue is probably the most important numeric quantity associated with a movie. We will try to predict the revenue for movies given a set of features in a later section. The treatment of revenue will be very similar to that of budget and we will once again begin by studying the summary statistics.

In [69]:
df['revenue'].describe()
Out[69]:
count    7.408000e+03
mean     6.878739e+07
std      1.464203e+08
min      1.000000e+00
25%      2.400000e+06
50%      1.682272e+07
75%      6.722707e+07
max      2.787965e+09
Name: revenue, dtype: float64

The mean gross of a movie is 68.7 million dollars whereas the median gross is much lower at 16.8 million dollars, suggesting the skewed nature of revenue. The lowest revenue generated by a movie is just 1 dollar whereas the highest grossing movie of all time has raked in an astonishing *2.78 billion dollars.

In [70]:
sns.distplot(df[df['revenue'].notnull()]['revenue'])
Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x22ad0cef5f8>

The distribution of revenue undergoes exponential decay just like budget. We also found that the two quantities were strongly correlated. Let us now take a look at the highest and least grossing movies of all time.

In [71]:
from IPython.display import Image, HTML
In [72]:
gross_top = df[['poster_path', 'title', 'budget', 'revenue', 'year']].sort_values('revenue', ascending=False).head(10)
pd.set_option('display.max_colwidth', 100)
HTML(gross_top.to_html(escape=False))
Out[72]:
poster_path title budget revenue year
14551 Avatar 237000000.0 2.787965e+09 2009
26555 Star Wars: The Force Awakens 245000000.0 2.068224e+09 2015
1639 Titanic 200000000.0 1.845034e+09 1997
17818 The Avengers 220000000.0 1.519558e+09 2012
25084 Jurassic World 150000000.0 1.513529e+09 2015
28830 Furious 7 190000000.0 1.506249e+09 2015
26558 Avengers: Age of Ultron 280000000.0 1.405404e+09 2015
17437 Harry Potter and the Deathly Hallows: Part 2 125000000.0 1.342000e+09 2011
22110 Frozen 150000000.0 1.274219e+09 2013
42222 Beauty and the Beast 160000000.0 1.262886e+09 2017

With these analyses in place, we are in a good position to construct our correlation matrix.

In [74]:
df['year'] = df['year'].replace('NaT', np.nan)
df['year'] = df['year'].apply(clean_numeric)
In [75]:
sns.set(font_scale=1)
corr = df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    plt.figure(figsize=(9,9))
    ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True, annot=True)

Genres

In [77]:
df['genres'] = df['genres'].fillna('[]').apply(ast.literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
In [78]:
s = df.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_df = df.drop('genres', axis=1).join(s)
gen_df['genre'].value_counts().shape[0]
Out[78]:
32

TMDB defines 32 different genres for our set of 45,000 movies. Let us now have a look at the most commonly occuring genres in movies.

In [79]:
pop_gen = pd.DataFrame(gen_df['genre'].value_counts()).reset_index()
pop_gen.columns = ['genre', 'movies']
pop_gen.head(10)
Out[79]:
genre movies
0 Drama 20265
1 Comedy 13182
2 Thriller 7624
3 Romance 6735
4 Action 6596
5 Horror 4673
6 Crime 4307
7 Documentary 3932
8 Adventure 3496
9 Science Fiction 3049
In [80]:
plt.figure(figsize=(18,8))
sns.barplot(x='genre', y='movies', data=pop_gen.head(15))
plt.show()

Drama is the most commonly occurring genre with almost half the movies identifying itself as a drama film. Comedy comes in at a distant second with 25% of the movies having adequate doses of humor. Other major genres represented in the top 10 are Action, Horror, Crime, Mystery, Science Fiction, Animation and Fantasy.</p>

The next question I want to answer is the trends in the share of genres of movies across the world. Has the demand for Science Fiction movies increased? Do certain years have a disproportionate share of Animation Movies? Let's find out!

We will only be looking at trends starting 2000. We will consider only those themes that appear in the top 15 most popular genres. We will exclude Documentaries, Family and Foreign Movies from our analysis.

In [81]:
genres = ['Drama', 'Comedy', 'Thriller', 'Romance', 'Action', 'Horror', 'Crime', 'Adventure', 'Science Fiction', 'Mystery', 'Fantasy', 'Mystery', 'Animation']
In [82]:
pop_gen_movies = gen_df[(gen_df['genre'].isin(genres)) & (gen_df['year'] >= 2000) & (gen_df['year'] <= 2017)]
ctab = pd.crosstab([pop_gen_movies['year']], pop_gen_movies['genre']).apply(lambda x: x/x.sum(), axis=1)
ctab[genres].plot(kind='bar', stacked=True, colormap='jet', figsize=(12,8)).legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.title("Stacked Bar Chart of Movie Proportions by Genre")
plt.show()
In [83]:
ctab[genres].plot(kind='line', stacked=False, colormap='jet', figsize=(12,8)).legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

The proportion of movies of each genre has remained fairly constant since the beginning of this century except for Drama. The proportion of drama films has fallen by over 5%. Thriller movies have enjoyed a slight increase in their share.

One question that I have always had is that if some genres are particularly more successful than others. For example, we should expect Science Fiction and Fantasy Movies to bring in more revenue than other genres but when normalized with their budget, do they prove to be as successful? We will visualize two violin plots to answer this question. One will be genres versus the revenue while the other will be versus returns.

Classification:

In [108]:
cls = df[df['return'].notnull()]
cls.shape
Out[108]:
(5381, 24)
In [109]:
cls.columns
Out[109]:
Index(['belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'original_language', 'original_title', 'overview', 'popularity',
       'poster_path', 'production_companies', 'production_countries',
       'release_date', 'revenue', 'runtime', 'spoken_languages', 'status',
       'tagline', 'title', 'video', 'vote_average', 'vote_count', 'return',
       'year'],
      dtype='object')
In [110]:
cls = cls.drop(['id', 'overview', 'poster_path', 'release_date', 'status', 'tagline', 'revenue'], axis=1)

Let us convert our return feature into a binary variable that will serve as our classes: 0 indicating a flop and 1 indicating a hit.

In [111]:
cls['return'] = cls['return'].apply(lambda x: 1 if x >=1 else 0)
cls['return'].value_counts()
Out[111]:
1    3776
0    1605
Name: return, dtype: int64

Our classes seem to be fairly balanced. We do not need to apply any additional methods to deal with the imbalance of classes. Let us now turn our attention to our features.

In [112]:
cls['belongs_to_collection'] = cls['belongs_to_collection'].fillna('').apply(lambda x: 0 if x == '' else 1)
In [113]:
sns.set(style="whitegrid")
g = sns.PairGrid(data=cls, x_vars=['belongs_to_collection'], y_vars='return', size=5)
g.map(sns.pointplot, color=sns.xkcd_rgb["plum"])
g.set(ylim=(0, 1))
Out[113]:
<seaborn.axisgrid.PairGrid at 0x22acf96b080>

It seems that movies that belong to a franchise have a higher probability of being a success.

In [116]:
cls['homepage'] = cls['homepage'].fillna('').apply(lambda x: 0 if x == '' else 1)
g = sns.PairGrid(data=cls, x_vars=['homepage'], y_vars='return', size=5)
g.map(sns.pointplot, color=sns.xkcd_rgb["plum"])
g.set(ylim=(0, 1))
Out[116]:
<seaborn.axisgrid.PairGrid at 0x22ac5a6d438>

We see that with homepages, there is not a very huge difference in probability. To avoid the curse of dimensionality, we will eliminate this feature as it is not very useful.

In [117]:
s = cls.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_cls = cls.drop('genres', axis=1).join(s)
In [118]:
ctab = pd.crosstab([gen_cls['genre']], gen_cls['return'], dropna=False).apply(lambda x: x/x.sum(), axis=1)
ctab.plot(kind='bar', stacked=True, legend=False)
Out[118]:
<matplotlib.axes._subplots.AxesSubplot at 0x22ac2d22c88>

We find that TV Movies have a 0% Failure Rate but that is most probably because they are extremely few in numner. Foreign Films have a higher rate of failure than average. Since there isn't anything drastic about a particular genre, we will proceed with one hot encoding all genres.

In [119]:
cls.columns
Out[119]:
Index(['belongs_to_collection', 'budget', 'genres', 'homepage',
       'original_language', 'original_title', 'popularity',
       'production_companies', 'production_countries', 'runtime',
       'spoken_languages', 'title', 'video', 'vote_average', 'vote_count',
       'return', 'year'],
      dtype='object')
In [120]:
def classification_engineering(df):
    for genre in genres_train:
        df['is_' + str(genre)] = df['genres'].apply(lambda x: 1 if genre in x else 0)
    df['genres'] = df['genres'].apply(lambda x: len(x))
    df = df.drop('homepage', axis=1)
    df['is_english'] = df['original_language'].apply(lambda x: 1 if x=='en' else 0)
    df = df.drop('original_language', axis=1)
    df['production_companies'] = df['production_companies'].apply(lambda x: len(x))
    df['production_countries'] = df['production_countries'].apply(lambda x: len(x))
    df['is_Friday'] = df['day'].apply(lambda x: 1 if x=='Fri' else 0)
    df = df.drop('day', axis=1)
    df['is_Holiday'] = df['month'].apply(lambda x: 1 if x in ['Apr', 'May', 'Jun', 'Nov'] else 0)
    df = df.drop('month', axis=1)
    df = df.drop(['title', 'cast', 'director'], axis=1)
    #df = pd.get_dummies(df, prefix='is')
    df['runtime'] = df['runtime'].fillna(df['runtime'].mean())
    df['vote_average'] = df['vote_average'].fillna(df['vote_average'].mean())
    df = df.drop('crew', axis=1)
    return df
In [122]:
cls.columns
Out[122]:
Index(['belongs_to_collection', 'budget', 'genres', 'homepage',
       'original_language', 'original_title', 'popularity',
       'production_companies', 'production_countries', 'runtime',
       'spoken_languages', 'title', 'video', 'vote_average', 'vote_count',
       'return', 'year', 'is_Animation', 'is_Comedy', 'is_Family',
       'is_Adventure', 'is_Fantasy', 'is_Drama', 'is_Romance', 'is_Action',
       'is_Crime', 'is_Thriller', 'is_History', 'is_Science Fiction',
       'is_Mystery', 'is_Horror', 'is_War', 'is_Foreign', 'is_Documentary',
       'is_Western', 'is_Music', 'is_nan', 'is_TV Movie'],
      dtype='object')
In [ ]:
cls = classification_engineering(cls)
In [123]:
X, y = cls.drop('return', axis=1), cls['return']
train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.75, test_size=0.25, stratify=y)
In [ ]:
clf = GradientBoostingClassifier()
clf.fit(train_X, train_y)
clf.score(test_X, test_y)
In [ ]:
plt.figure(figsize=(10,12))
sns.barplot(x=clf.feature_importances_, y=X.columns)

We see that Vote Count is once again the most significant feature identified by our Classifier. Other important features include Budget, Popularity and Year. With this, we will conclude our discussion on the classification model and move on to the main part of the project.

Regression: Predicting Movie Revenues

In [127]:
rgf = df[df['return'].notnull()]
rgf.shape
Out[127]:
(5381, 24)

We have 5393 records in our training set. Let us take a look at the features we possess and remove the ones which are unnecessary.

In [128]:
rgf.columns
Out[128]:
Index(['belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'original_language', 'original_title', 'overview', 'popularity',
       'poster_path', 'production_companies', 'production_countries',
       'release_date', 'revenue', 'runtime', 'spoken_languages', 'status',
       'tagline', 'title', 'video', 'vote_average', 'vote_count', 'return',
       'year'],
      dtype='object')
In [129]:
rgf = rgf.drop(['id', 'overview', 'poster_path', 'release_date', 'status', 'tagline', 'video', 'return'], axis=1)

We will perform the following feature engineering tasks:

  • belongs_to_collection will be turned into a Boolean variable. 1 indicates a movie is a part of collection whereas 0 indicates it is not.
  • genres will be converted into number of genres.
  • homepage will be converted into a Boolean variable that will indicate if a movie has a homepage or not.
  • original_language will be replaced by a feature called is_foreign to denote if a particular film is in English or a Foreign Language.
  • production_companies will be replaced with just the number of production companies collaborating to make the movie.
  • production_countries will be replaced with the number of countries the film was shot in.
  • day will be converted into a binary feature to indicate if the film was released on a Friday.
  • month will be converted into a variable that indicates if the month was a holiday season.
In [130]:
s = rgf.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_rgf = rgf.drop('genres', axis=1).join(s)
genres_train = gen_rgf['genre'].drop_duplicates()
In [131]:
def feature_engineering(df):
    df['belongs_to_collection'] = df['belongs_to_collection'].apply(lambda x: 0 if x == np.nan else 1)
    for genre in genres_train:
        df['is_' + str(genre)] = df['genres'].apply(lambda x: 1 if genre in x else 0)
    df['genres'] = df['genres'].apply(lambda x: len(x))
    df['homepage'] = df['homepage'].apply(lambda x: 0 if x == np.nan else 1)
    df['is_english'] = df['original_language'].apply(lambda x: 1 if x=='en' else 0)
    df = df.drop('original_language', axis=1)
    df['production_companies'] = df['production_companies'].apply(lambda x: len(x))
    df['production_countries'] = df['production_countries'].apply(lambda x: len(x))
    df['is_Friday'] = df['day'].apply(lambda x: 1 if x=='Fri' else 0)
    df = df.drop('day', axis=1)
    df['is_Holiday'] = df['month'].apply(lambda x: 1 if x in ['Apr', 'May', 'Jun', 'Nov'] else 0)
    df = df.drop('month', axis=1)
    df = df.drop(['title', 'cast', 'director'], axis=1)
    df = pd.get_dummies(df, prefix='is')
    df['runtime'] = df['runtime'].fillna(df['runtime'].mean())
    df['vote_average'] = df['vote_average'].fillna(df['vote_average'].mean())
    return df
In [139]:
X, y = rgf.drop('revenue', axis=1), rgf['revenue']
In [140]:
train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.75, test_size=0.25)
X.shape
Out[140]:
(5381, 15)
In [ ]:
reg = GradientBoostingRegressor()
reg.fit(train_X, train_y)
reg.score(test_X, test_y)

We get a Coefficient of Determination of 0.78 which is a pretty score for the basic model that we have built. Let us compare our model's score to a Dummy Regressor.

In [142]:
dummy = DummyRegressor()
dummy.fit(train_X, train_y)
dummy.score(test_X, test_y)
Out[142]:
-0.00019108364622710816

We see that our model performs far more superiorly than the Dummy Regressor. Finally, let us plot the feature importances in the form of a bar plot to deduce which features were the most significant in our making predictions.

In [ ]:
sns.set_style('whitegrid')
plt.figure(figsize=(10,12))
sns.barplot(x=reg.feature_importances_, y=X.columns)

We notice that vote_count, a feature we cheated with, is the most important feature to our Gradient Boosting Model. This goes on to show the improtance of popularity metrics in determining the revenue of a movie. Budget was the second most important feature followed by Popularity (Literally, a popularity metric)